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COMBINING INFORMATION FROM INDEPENDENT SOURCES 
THROUGH CONFIDENCE DISTRIBUTIONS 1 

By Kesar Singh, Minge Xie and William E. Strawderman 

Rutgers University 

This paper develops new methodology, together with related the- 
ories, for combining information from independent studies through 
confidence distributions. A formal definition of a confidence distri- 
bution and its asymptotic counterpart (i.e., asymptotic confidence 
distribution) are given and illustrated in the context of combining 
information. Two general combination methods are developed: the 
first along the lines of combining p-values, with some notable dif- 
ferences in regard to optimality of Bahadur type efficiency; the sec- 
ond by multiplying and normalizing confidence densities. The latter 
approach is inspired by the common approach of multiplying like- 
lihood functions for combining parametric information. The paper 
also develops adaptive combining methods, with supporting asymp- 
totic theory which should be of practical interest. The key point of the 
adaptive development is that the methods attempt to combine only 
the correct information, downweighting or excluding studies contain- 
ing little or wrong information about the true parameter of interest. 
The combination methodologies are illustrated in simulated and real 
data examples with a variety of applications. 

1. Introduction and motivations. Point estimators, confidence intervals 
and p- values have long been fundamental tools for frequentist statisticians. 
Confidence distributions (CDs), which can be viewed as "distribution es- 
timators," are often convenient devices for constructing all the above sta- 
tistical procedures plus more. The basic notion of CDs traces back to the 
fiducial distribution of Fisher (1930); however, it can be viewed as a pure 
frequentist concept. Indeed, as pointed out in Schweder and Hjort (2002), 
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the CD concept is "Neymannian interpretation of Fisher's fiducial distribu- 
tion" [Neyman (1941)]. Its development has proceeded from Fisher (1930) 
though recent contributions, just to name a few, of Efron (1993, 1998), 
Fraser (1991, 1996), Lehmann (1993), Schweder and Hjort (2002) and oth- 
ers. There is renewed interest in CDs [Schweder and Hjort (2002)], partly 
because "statisticians will be asked to solve bigger and more complicated 
problems" [Efron (1998)] and the development of CDs might hold a key to 
"our profession's 250-year search for a dependable objective Bayes theory" 
[Efron (1998) and Schweder and Hjort (2002)]. 

This paper is mainly focused on some new developments on the "com- 
bination" aspect of CDs, where two natural approaches of combining CD 
information from independent studies are considered. The first approach is 
from the p-value combination scheme which dates back to Fisher (1932); 
see also Littell and Folks (1973) and Marden (1991), among many others. 
The second approach is analogous to multiplying likelihood functions in 
parametric inference. The two approaches are compared in the case of com- 
bining asymptotic normality based CDs. We require the resulting function 
of combined CDs to be a CD (or an asymptotic CD) so that it can be used 
later on to make inferences, store information or combine information in a 
sequential way. 

For this purpose, we adopt a formal definition of CD developed by and pre- 
sented in Schweder and Hjort (2002), and extend it to obtain a formal defini- 
tion of asymptotic confidence distributions (aCDs). Suppose X±, X2, ■ ■ ■ ,X n 
are n independent random draws from a population F and X is the sam- 
ple space corresponding to the data set X n = (X±, X2, ■ ■ ■ , X n ) T . Let 9 be 
a parameter of interest associated with F (F may contain other nuisance 
parameters), and let G be the parameter space. 

Definition 1.1. A function H n (-) = iT n (X n , •) on X x — >■ [0,1] is 
called a confidence distribution (CD) for a parameter 9 if (i) for each given 
X n 6 X, H n {-) is a continuous cumulative distribution function; (ii) at the 
true parameter value 9 = 0q, H n (9o) = H n (X. n , 9q), as a function of the sam- 
ple X n , has the uniform distribution U(0, 1). 

The function H n (-) is called an asymptotic confidence distribution (aCD) 

if requirement (ii) above is replaced by (ii)': at 9 = 9q, H n (9o) ^ [7(0, 1) as 
n — > +00, and the continuity requirement on H n (-) is dropped. 

We call, when it exists, h n {9) = H' n (9) a CD density, also known as a confi- 
dence density in the literature. This CD definition is the same as in Schweder 
and Hjort (2002), except that we suppress possible nuisance parameter (s) 
for notational simplicity. Our version, which was developed independently 
of Schweder and Hjort (2002), was motivated by the observation (1.1) be- 
low. For every a in (0,1), let (— oo,£ n (a)] be a 100a% lower-side confidence 
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interval, where = £n(X n , a) is continuous and increasing in a for each 

sample X n . Then H n (-) =Cn 1 (") i s a CD in the usual Fisherian sense. In 
this case, 

{X n : H n (9) <a} = {X n :6< £„(«)} 

(1.1) 

for any a in (0, 1) and 9 in C M. 

Thus, at 6> = 6»o, Pr{H n (6 ) < a} = a and H n (6 ) is *7(0, 1) distributed. 
Definition 1.1 is very convenient for the purpose of verifying if a particular 
function is a CD or an aCD. 

The notion of a CD (or aCD) is attractive for the purpose of combining 
information. The main reasons are that there is a wealth of information on 
6 inside a CD, the concept of CD (and particularly aCD) is quite broad, and 
the CDs are relatively easy to construct and interpret. Section 2 provides a 
brief review of materials related to the CDs along these views. See Schweder 
and Hjort (2002) for an expanded discussion of the concept of CDs and the 
information contained in CDs. 

The main developments are in Sections 3 and 4. We provide in Section 3 
a general recipe by adopting a general p-value combination scheme. Sec- 
tion 3.1 derives an optimal method for combining CDs associated with the 
same parameter, where the optimality is in terms of the Bahadur slope. The 
optimal scheme is notably different from that for combining p- values. Sec- 
tion 3.2 proposes adaptive combination methods, in the setting where the 
parameter values in some of the prior studies are not necessarily the same as 
the parameter value in the current study. The properties of adaptive consis- 
tency and adaptive efficiency are discussed. Analogous to combining likeli- 
hood functions in likelihood inference, we study in Section 4 a combination 
approach of multiplying CD densities. There we also provide a comparison 
of the two different CD-combining approaches in the case of normal type 
aCDs. Section 5 illustrates the methodology through three examples, each 
of which has individual significance. The proofs are in the Appendix. 

2. Examples and inferential information contained in a CD. The notion 
of CDs and aCDs covers a broad range of examples, from regular parametric 
cases to p-value functions, normalized likelihood functions, bootstrap distri- 
butions and Bayesian posteriors, among others. 

Example 2.1. Normal mean and variance. Suppose X\,X2, ■ ■ ■ ,X n is 
a sample from N(fi,a 2 ), with both fj, and a 2 unknown. A CD for \i is 

H n {y) = Ftn-i ( w^ )' wnere X and are, respectively, the sample mean 
and variance, and Ft n _ 1 (-) is the cumulative distribution function of the 
Student t n _i -distribution. A CD for a 2 is HJy) = 1 - F Y 2 ( (w ~ 1)s " ) for 
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y > 0, where F 2 (•) is the cumulative distribution function of the Xn-i' 
distribution. 

Example 2.2. p-value function. For any given 9, let p n (9) = p n (X n ,#) 
be a p-value for a one-sided test Kq: 9 <9 versus K\\ 9 > 9. Assume that the 
p-value is available for all 9. The function p n (') is called a p-value function 
[Eraser (1991)]. Typically, at the true value 9 = 6q, p n (Go) as a function of 
X n is exactly (or asymptotically) £7(0, 1) -distributed. Also, H n (-) = p n (-) 
for every fixed sample is almost always a cumulative distribution function. 
Thus, usually p n (-) satisfies the requirements for a CD (or aCD). 

Example 2.3. Likelihood functions. There is a connection between the 
concepts of aCD and various types of likelihood functions, including like- 
lihood functions in single parameter families, profile likelihood functions, 
Efron's implied likelihood function and Schweder and Hjort's reduced like- 
lihood function, and so on. In fact, one can easily conclude from Theorems 
1 and 2 of Efron (1993) that in an exponential family, both the profile like- 
lihood and the implied likelihood [Efron (1993)] are aCD densities after a 
normalization. Singh, Xie and Strawderman (2001) provided a formal proof, 
with some specific conditions, which shows that e^^ is proportional to an 
aCD density for the parameter 9, where £* n {9) = £ n {9) — £ n (9), £ n {9) is the 
log-profile likelihood function, and 9 is the maximum likelihood estimator 
of 9. Schweder and Hjort (2002) proposed the reduced likelihood function, 
which itself is proportional to a CD density for a specially transformed 
parameter. Also see Welch and Peers (1963) and Fisher (1973) for earlier 
accounts of likelihood function based CDs in single parameter families. 

Example 2.4. Bootstrap distribution. Let 9 be a consistent estimator of 
9. In the basic bootstrap methodology the distribution of 9 — 9 is estimated 
by the bootstrap distribution of 9b — 0, where 9b is the estimator 9 com- 
puted on a bootstrap sample. An aCD for 9 is H n (y) = Pb(9b > 2$ — y) = 
1 — Pb(@b — 6 < — y)-, where Pb(-) is the probability measure induced by 
bootstrapping. As n — > 00, the limiting distribution of normalized 9 is often 
symmetric. In this case, due to the symmetry, the raw bootstrap distribution 
Hn{y) = Pb(8b < y) is also an aCD for 9. 

Other examples include a second-order accurate CD of the population 
mean based on Hall's [Hall (1992)] second-order accurate transformed t- 
statistic, an aCD of the correlation coefficient based on Fisher's z-score 
function, among many others. See Schweder and Hjort (2002) for more ex- 
amples and extended discussion. 

A CD contains a wealth of information, somewhat comparable to, but 
different than, a Bayesian posterior distribution. A CD (or aCD) derived 
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from a likelihood function can also be interpreted as an objective Bayesian 
posterior. We give a brief summary below of information in a CD related to 
some basic elements of inference. The reader can find more details in Singh, 
Xie and Strawderman (2001). This information is also scattered around in 
earlier publications, for example, in Fisher (1973), Fraser (1991, 1996) and 
Schweder and Hjort (2002), among others. 

• Confidence interval. From the definition, it is evident that the intervals 
(-oo,.H'- 1 (l-a)], [H' 1 ^), +oo) and (H- 1 (a/2),H- 1 (l - a/2)) provide 
100(1 — a)%-level confidence intervals of different kinds for 9, for any 
a£ (0,1). The same is true for an aCD, where the confidence level is 
achieved in limit. 

• Point estimation. Natural choices of point estimators of the parameter 9, 
given H n (9), include the median M n = H~ l (l/2), the mean 9 n = J_^tdH n (t) 

and the maximum point of the CD density 9 n = argmaxg h n (9), h n (9) = 
H' n {9). Under some modest conditions one can prove that these point 
estimators are consistent plus more. 

• Hypothesis testing. From a CD, one can obtain p- values for various hypoth- 
esis testing problems. Fraser (1991) developed some results on such a topic 
through p-value functions. The natural line of thinking is to measure the 
support that H n (-) lends to a null hypothesis Kq :9 £ C. We perceive two 
types of support: 1. Strong -support p s (C) = J c dH n (9). 2. Weak-support 
Pw{C) = sup d£C 2mm(H n (9), 1 — H n {9)). If Kq is of the type (— oo,#o] or 
[9q, oo) or a union of finitely many intervals, the strong-support p s {C) 
leads to the classical p- values. If Kq is a singleton, that is, Kq is 9 = 9q, 
then the weak-support p w (C) leads to the classical p- values. 

3. Combination of CDs through a monotonic function. In this section 
we consider a basic methodology for combining CDs which essentially orig- 
inates from combining p-values. However, there are some new twists, mod- 
ifications and extensions. Here one assumes that some past studies (with 
reasonably sensible results) on the current parameter of interest exist. The 
CDs to be combined may be based on different models. A nice feature of 
this combination method is that, after combination, the resulting function 
is always an exact CD if the input CDs from the individual studies are ex- 
act. Also, it does not require any information regarding how the input CDs 
were obtained. Section 3.1 considers the perfect situation when the common 
parameter had the same value in all previous studies on which the CDs are 
based. Section 3.2 presents an adaptive combination approach which works 
asymptotically, even when there exist some "wrong CDs" (CDs with under- 
lying true parameter values different from 9q). For clarity, the presentation 
in this section is restricted to CDs only. The entire development holds for 
aCDs with little or no modification. 
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3.1. CD combination and Bahadur efficiency. Let Hi(y), . . . , H^y) be 
L independent CDs, with the same true parameter value 9q (sample sizes 
are suppressed in the CD notation in the rest of this paper). Suppose 
g c (ui, ■ ■ ■ ,ul) is any continuous function from [0, 1]^ to R that is monotonic 
in each coordinate. A general way of combining, depending on g c (u\, . . . , ul), 
can be described as follows: Define H c (u\, . . . , ul) = G c {g c {u\, . . . , ul)), where 
G c (-) is the continuous cumulative distribution function of g c {Ui, . ■ . , Ul), 
and Ui,...,Ul are independent U(0, 1) distributed random variables. De- 
note 

(3.1) H c (y) = H c (H l (y),...,H L (y)). 

It is easy to verify that H c (y) is a CD function for the parameter 6. We call 
H c (y) a combined CD. If the objective is only to get a combined aCD, one 
may also allow the above g c function to involve sample estimates. 

Let -Fo(') be any continuous cumulative distribution function and i ? " 1 (-) 
be its inverse function. A convenient special case of the function g c is 

(3.2) g c (u 1 ,u 2 , . . . ,u L ) = F^ l { Ul ) + F^{u 2 ) + • • • + F^{u L ). 

In this case, G c {-) = F$ * ■ ■ ■ * Fq(-), where * stands for convolution. Just 
like the p-value combination approach, this general CD combination recipe 
is simple and easy to implement. Some examples of Fq are: 

• Fo(t) = $(t) is the cumulative distribution function of the standard nor- 
mal. In this case 

H NM (y) = $(^[$- 1 (H 1 (y)) + ^-\H 2 (y)) + ■■■ + &-\H L (y))]) . 

• Fo(t) = 1 — e _t , for t > 0, is the cumulative distribution function of the 
standard exponential distribution (with mean 1). Or, Fo(t) = e*, for t < 0, 
which is the cumulative distribution function of the mirror image of the 
standard exponential distribution. In these cases the combined CDs are, 
respectively, 



Hei(v) = P [xIl < "2 E Ml " Hi(y)) 

and 



i=l 



H E2 (y) = P (xIl > -2 E tegHM^ 



where x\ L is a ^-distributed random variable with 2L degrees of freedom. 
The recipe for HE2(y) corresponds to Fisher's recipe of combining p- values 
[Fisher (1932)]. 
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• Fa(t) = ^e*l( f < ) + (1 — ^e~*)l( i>0 ), denoted as DE(t) from now on, is 
the cumulative distribution function of the standard double exponential 
distribution. Here 1q is the indicator function. In this case the combined 
CD is 

H DE (y) = DEl(DE^ 1 (Hi(i/)) + ■■■ + DE-\H L {y))) } 

where DEi(t) = DE * ■ ■ ■ * DE(t) is the convolution of L copies of DE(t). 

Lemma 3.1 next gives an iterative formula to compute DEi{t). One crit- 
ical fact of this lemma is that the exponential parts of the tails of DEi,(t) 
are the same as those of DE{t). The proof of Lemma 3.1 is in the Appendix. 

Lemma 3.1. For t>0 we have 

1 - DE L (t) = DE L {-t) = \V L (t)e-\ 

where Vi(t) is a polynomial of degree L—l. This sequence of polynomials 
satisfies the following recursive relation: for k = 2, 3, . . . , L, 

2V k (t) = V k ^(t) + /Vfc-i(s) " Vfc-i(*)] ds 
Jo 

POO 

+ / [V k . 1 (s) + V k . 1 (t + s)-Vl_ 1 (s)]e- 2s ds. 
Jo 

In particular, V±(t) = 1 and V2(t) = 1 + 1/2. 

Littell and Folks (1973) established an optimality property, in terms of 
Bahadur slope, within the class of combined p-values based on monotonic 
combining functions. Along the same line, we establish below an optimality 
result for the combination of CDs. 

Following Littell and Folks (1973), we define the concept of Bahadur slope 
for a CD: 

Definition 3.1. Let n be the sample size corresponding to a CD func- 
tion H(-). We call a nonnegative function S(t) = S(t;6o) the Bahadur slope 
for the CD function H(-) if for any e > 0, S(—e) = — lim„^ +00 ^ logH(6(j — e) 
and 5(e) = — lim n ^ +00 - log{l — H (9q + e)} almost surely. 

The Bahadur slope gives the rate, in exponential scale, at which H(6q — e) 
and 1 — H(8q + e) go to zero. The larger the slope, the faster its tails decay 
to zero. In this sense, a CD with a larger Bahadur slope is asymptotically 
more efficient as a "distribution-estimator" for #o- 

Suppose n\, n 2 , . . . , n^, the sample sizes behind the CDs Hi(y),H2(y), ■ ■ ■ , 
go to infinity at the same rate. For notational simplicity, replace n\ by n 
and write n,- = {A,- + o(l)}n for j = 1, 2, 3, . . . , L; we always have Ai = 1. Let 
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Sj(t) be the Bahadur slope for Hj(y), j = 1, 2, . . . , L, and S c (t) be the Ba- 
hadur slope for their combined CD, say H c (y). The next theorem provides 
an upper bound for S c (t) [i.e., the fastest possible decay rate of H c (y) tails] 
and indicates when it is achieved. Its proof can be found in the Appendix. 

Theorem 3.2. Under 9 = 9 , for any e > 0, as n ^ +oo, 

1 L 

- lim inf - log H c (6 - e) <J2 x j s j(~ £ ) 

n 3=1 " 

and 

1 L 
- liminf - log(l - H c (6 + e))<J2 X J S i( e ) 

3=1 

almost surely. If the slope function S c (-) exists, 

S c (-e)<j2^S 3 (-e)/j2^ and ^^A^/^A,-. 

3=1 3=1 3=1 3=1 




0.0 0.5 1.0 1.5 2-0 

X 

Fig. 1. A typical figure of density plots of Hde{-), Hei(-) and He2(-) , when we combine 
independent CDs for the common mean parameter /j, of N(jj,, 1.0) and N([i, 1.5 ); the true 
fi — 1 and the sample sizes are ni = 30 and U2 = 40. The solid, dotted and dashed curves 
are the density curves of Hde(-), Hei(-) and He2{-), respectively. 
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Furthermore, almost surely, for any e > 0, 

L L L L 

S E1 (e) = Y / ^S j (e)/Y / ^ and S E2 (-e) = £ A^-e)/^ ^ 

i=l J=l 3=1 3 =1 

L L L L 

SDs(-e) = E A ^i(- £ )/E A i and S DE (e) = Y / ^S j (z)/Y, X 3- 
j=i j=i j=i j=i 

Here, SEi(t), SE 2 (t) an d Sde^) are the Bahadur slope functions of com- 
bined CDs He\{x), He 2 {x) and Hee(x), respectively. 

This theorem states that the DE{t) based combining approach is, in fact, 
optimal in terms of achieving the largest possible value of Bahadur slopes 
on both sides. The two combined CDs Hei{v) and He2(v) can achieve the 
largest possible slope value only in one of the two regions, 9 > 9q or 9 <8q. 
This phenomenon is illustrated in Figure 1. 

Note that, in the p-value combination case, Littell and Folks (1973) es- 
tablished that Fisher's way of combination is optimal in terms of having the 
largest possible Bahadur slope. To our knowledge, no one has considered the 
DE{t) based combination rule when combining p-values. There is a notable 
difference between combining p- values and CDs. While for CDs one cares 
about the decay rates of both tails, separately, a typical p-value concept 
either involves only one tail of the distribution of a test statistic or lumps 
its two tails together. The DE(t) based combination rule is quite natural 
when combining CDs, but not when combining p-values. 

3.2. Adaptive combination. The development in Section 3.1 is under the 
assumption that all the CDs H\(y), . . . , iJ^(y) are for the same parameter 
9, with identical true value 9 = 9$. There may be doubts about the validity 
of this assumption. For instance, let H\(y) be a CD for 9 with true value 
9o, based on a current study of sample size n\. Let H2 (y ),..., H E {y) be 
available CDs on 9 based on previous (independent) studies involving sample 
sizes {n2, ■ ■ ■ ,til}. One could be less than certain that all earlier values of 9 
were indeed equal to the current value, 9 = 9q. It will be problematic if one 
combines all the available CDs when some of the studies had the underlying 
value 9 ^ 9q. Indeed, the resulting function of combination will not even be 
an aCD (under the true value 9 = 9q). This can be demonstrated by a simple 

example of combining two CDs: Hi{y) = *(^r^=) and H 2 {y) = 

where a is known, y/n{6]_ - O ) ~ N(0,a 2 ), and ^/n(9 2 - 9~o) ~ N(0,a 2 ), 
for some 9q ^ 9q. The combined outcome by the normal-based approach, 

Hnm(u) = $( 2 ^ l~^- 2 ), is not uniformly distributed, even in limit (n — ► 00) 

when y = 9q. In this section we propose adaptive combination approaches 
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to remedy this problem and go on to establish related optimality properties 
under a large sample setting. 

Let H±(-) be a CD with true parameter value 9 = 60, based on a current 
study. The L—l CDs from the previous studies on 9 are separated into two 
sets: 

Ho = {Hj : Hj has the underlying value of 9 = 9q, j '< = 2, . . . , L], 
Til = {Hj : Hj has the underlying value of 9 7^ 9q, j > = 2, . . . , L}. 

The set Ho contains the "right" CDs and Hi contains the "wrong" CDs. We 
assume, however, the information about Ho and Hi is unavailable to us. 

The development of a general adaptive combination recipe starts with an 
extension of the general combination method of Section 3.1, which includes 
a set of weights u = . . . ,u>l), oj\ = 1. Our intention is to select a set of 
adaptive weights that can filter out the "wrong" CDs (in Hi) and keep the 
"right" CDs (in Ho) asymptotically. 

Although it could be much more general, for simplicity, we let 

L 

(3.3) 9c,u(u\, ...,u L ) = ^2oj j F ~ 1 (uj), 

i=i 

where Fq(-) is a continuous cumulative distribution function with a bounded 
density. Let G C)£i) (i) = F (t) * F (±) *■■■* F (j£) and H c , w (ui, ...,u L ) = 
G c ,ui{g c ,u{ui, ...,u L )).We define 

H c ^ (y) = H c ^(H 1 (y),H 2 (y),...,H L (y)). 

Define the weight vector u>q as (1,uj^\ ■ ■ ■ ), where u>^ = 1 for Hj £ 

H , and iof ] = for Hj G Hi. The combined CD function H [ c\y) = H c ^ {y) 
is our target which combines H\ with the CDs in Hq. Of course, we lack the 
knowledge of Hq and Hi, so uq is unknown. Thus, we need to determine the 
adaptive weights, denoted by u* = (l,^, . . . ,u>2), converging to uq, from 
the available information in the CDs. Let H*(y) = H ctl) *(y). One would hope 
that H*{y) is at least an aCD. 

Definition 3.2. A combination method is adaptively consistent if H * (y) 
is an aCD for 9 = 9q. 

Suppose ni,W2, . . . ,til go to infinity at the same rate. Again, we let n = n\ 
and write rij = {Aj + o(l)}n, for i = 1,2, ...,L; Ai = 1. We define below 
adaptive slope efficiency. 
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Definition 3.3. A combination method is adaptively slope efficient if 
for any e > 0, 

- lim -logH*(-e) = £ XjSji-e), 

n— >+oo n * — ' 

- lim ilog{l-^(e)}= E XjSjie). 

j-.HjeiH^UHa 

Here Sj(t) is the Bahadur slope of Hj(x), all assumed to exist. 

Let Ii be a confidence interval derived from fli(-), for i = 1, 2, . . . , L. Sup- 
pose, as n — ► oo, the lengths of these L confidence intervals all go to zero 
almost surely. Then, for all large n, it is expected that I\ n Ij = for j 
such that Hj £ Ti\. This suggests that in order to get rid of the CDs in 
Til when n is large, we should take uj* = l(/ 1 n/ J -^0) f° r J = 2, • • • , L, where 
1m is the indicator function. With this choice of data-dependent weights, 
we have the following theorem. The theorem can be easily proved using the 
Borel-Cantelli lemma and its proof is omitted. 

Theorem 3.3. Assume that the intervals Ij lie within an e -neighborhood 
of the corresponding true value of 6, for all large n almost surely, and for 
any fixed e > 0. In addition, assume that, for each j such that Hj e TCq, we 
have 

+oo 

(3.4) ^P(/in/ i = 0)<+oo. 



71=1 



Then if uj* = I( hnIj ^ ) for j = 1,2, . . . ,L, we have sup y \H*(y) - H^ 0) (y)\ = 
0, for all large n almost surely. 

Note that He (y) is the "target" combined CD. From Theorem 3.3 we 
immediately have the following corollary. 

Corollary 3.4. Under the assumptions of Theorem 3.3, the adaptive 
combination recipe described in this section, with the adapting weights tOj = 
l(/in/-^0) f or J = 2, . . . , L, is adaptively consistent. Furthermore, if Fo(t) = 
DE(t) in (3.3), this combination method is also adaptively slope efficient. 



Remark 3.1. A simple example is to take Ij = (H^ 1 (a n /2), H^ 1 (1 — 
On/ 2)), i = 1,2, . . . ,L. It follows that P(h n Ij = 0) < 1 - (1 - a n ) 2 < 2a n 
for each j such that Hj 6 Hq. Thus, z~2n=i «n < o° is a sufficient condition 
for (3.4). However, this bound is typically very conservative. To see this, 
consider the basic example of z-based CD for unknown normal mean with 
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known variance a 2 . Let Hi{y) = ${-^/n{y — Xj)/cr), where Xi is the sample 
mean in the ith study, and z an /2 is the normal critical value of level a n /2. 
We have P{h n Ij = 0) = 2(1 - &(\/2z an / 2 )), which could be a lot smaller 
than 2a n . Considering this issue, we recommend a somewhat substantial 
value for a n in applications. 

A feature that may be regarded as undesirable with the above adaptive 
method is the fact that it assigns weights either or 1 to the CDs. We 
propose below the use of kernel function based weights, which take values 
between and 1. Under some regularity conditions, we will show that the 
weighted adaptive combination method with the kernel based weights is 
adaptively consistent and "locally efficient" (Theorem 3.5). 

Let K{t) be a symmetric kernel function, / K(t) dt = 1, / tK(t) dt = and 
/ t 2 K(t)dt = 1. In the present context we also require that the tails of the 
kernel function tend to zero at an exponential rate. Some examples are the 
normal kernel K(t) = <f>(t), the triangle kernel and the rectangular kernel 
function, among others. 

In order to use the kernel function, some measure of "distance" between 
Hi(y) and Hj(y), j = 2, . . . ,L, is needed. For illustrative purposes, we use 
Q\ — 9j, where Oi, i = 1,...,L, are point estimators obtained from Hi{y), 
respectively. We assume Oi, for i = 1,2,..., L, converge in probability to 
their respective underlying values of 0, say 6*0,4, at the same polynomial 
rate. For i = 1 or i such that Hi G 7io, #0,? = &o- Let b n — ► be such that 
\0i — 0oi\ = Op{b n ). We define the kernel function based weights as 

(3.5) u,] = K(t^^/K(Q) for j = l,2,...,L. 

Among many other possibilities, one set of convenient choices is Oi = H^ 1 ^) 
and b n = y/Rn, where R n = X (|) — H^ 1 ^) is the interquartile range 
of Htiy). 

Under the above setting, we have the following theorem; its proof is in 
the Appendix. 

Theorem 3.5. Let 5 n > be a sequence such that Hi(0o±5 n ) are bounded 
away from and 1, in probability, for i = 1 and i with Hi £ Hq. Suppose Fq 
in (3.3) is selected such that min{i ? o(t),l — Fo(t)} tends to zero, exponen- 
tially fast as \t\ — ► 00. Then, with uj* as defined in (3.5), one has 

sup \H*(y) — H^°\y)\ — ► in probability, as n — ► +00. 
xe[0 -8 n ,0 +5n] 
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Theorem 3.5 suggests that in a local neighborhood of 60, H*(y) and 

Hc°\y) are close for large n. Recall that Hc°\y) is the target that com- 
bines H\ with the CDs in TCq. The following conclusion is immediate from 
Theorem 3.5. 

Corollary 3.6. Under the setting of Theorem 3.5, with to* as in (3.5), 
the adaptive combination method described in this section is adaptively con- 
sistent. 

The result in Theorem 3.5 is a local result depending on S n , which is 
typically 0(ra -1 / 2 ). For a set of general kernel weights of the form (3.5), we 
cannot get an anticipated adaptive slope efficiency result for the adaptive 
DE combination method. But, for the rectangular kernel, this optimality 
result does hold, since in this case the weight Uj becomes either 1 or for 
all large n, almost surely. The proof of the following corollary is similar to 
that of Theorem 3.4 and is omitted. 

Corollary 3.7. Under the setting of Theorem 3.5, with uj* as in (3.5) 
and K(t) = {l/(2\/3)}l^u| <N /3-), the adaptive combination method described 
in this section with Fo(t) = DE(t) is adaptively slope efficient ifY^=\ P{\@j ~ 
9 j\ > ^fb n ) < 00 for j = 1, 2, ... ,L. 

4. Combination of CDs through multiplying CD densities. Normalized 
likelihood functions (as a function of the parameter) are an important source 
of obtaining CD or aCD densities. In fact, it was Fisher who prescribed the 
use of normalized likelihood functions for obtaining his fiducial distributions; 
see, for example, Fisher (1973). Multiplying likelihood functions from inde- 
pendent sources constitutes a standard method for combining parametric 
information. Naturally, this suggests multiplying CD densities and normal- 
izing to possibly derive combined CDs as follows: 

(4.1) H P (0)= I h*(y)dy/ [ h*(y)dy, 

where h*(y) = Yli=ihi(y) and hi(y) are CD densities from L independent 
studies. Schweder and Hjort (2002) suggested multiplying their reduced like- 
lihood functions for combined estimation, which is closely related to the ap- 
proach of (4.1). However, they did not require normalization and, strictly 
speaking, the reduced likelihood function, in general, is not a CD density 
for 6 (it is only proportional to a CD density for a specially transformed 
parameter). 

Unfortunately, the combined function -Hp(-) may not necessarily be a CD 
or even an aCD function in general. But we do have some quite general 
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affirmative results. We first present here a basic result pertaining to Hp(-). 
Let Ti, T2, ■ ■ ■ ,Tl be a set of statistics from L independent samples. Suppose 
Hi(-), i = 1,2, ... ,L, are the cumulative distribution functions of Tj — 9 with 
density functions hi(-) which are entirely free of parameters. Thus, one has 
L CDs of 9, given by Hi(8) = 1 — Hi(Ti — 9) with corresponding CD densities 
h l {9) = h i (T l -9). 

Theorem 4.1. In the above setting, Hp{9) is an exact CD of 9. 

An elementary proof of this theorem is given in the Appendix. This the- 
orem can also be proved using the general theory relating best equivariant 
procedures and Bayes procedures relative to right invariant Haar measure 
[see, e.g., Berger (1985), Chapter 6 for the Bayes invariance theory]. Using 
this Bayes-equivariance connection, or directly, one can also obtain an exact 
CD for the scale parameter 9, but it requires one to replace h*(y) in (4.1) 
with h**(y)/y, where h**(y) = Uf=i{yhi(y)} ■ 

The original method of (4.1) does not yield an exact CD for a scale pa- 
rameter. Let us consider a simple example. 

Example 4.1. Consider the ^[0,0] distribution with unknown 9. Let 
Hi{9) = l-(^f) ni over 9 > Yi, i = 1, 2, be the input CDs, where Y\ and Y 2 are 
maxima of two independent samples of sizes n\ and n 2 - The multiplication 
method (4.1) yields H P {9) = (^) ni+n2+1 , over 9 > Y = max(Yi,y 2 )- This 
Hp{9) is not an exact CD, though it is an aCD. 

The setting for Theorem 4.1 is limited. But it allows an asymptotic ex- 
tension that covers a wide range of problems, including those involving the 
normal and "heavy tailed" asymptotics, as well as other nonstandard asymp- 
totics such as that in Example 4.1. 

Let Hi a be an asymptotic (weak limit) cumulative distribution function 
of £j = nf-y—, where (Tj, Vi) are statistics based on independent samples of 
sizes ni, i = 1, . . . , L. Denote hi a (-) = H[ a {-). One has aCD densities given 
by 

(4-2) hia ( 9 ) = ^l i , a (nf'^y 

Let £j have uniformly bounded exact densities /ii, e (-), for i = 1, . . . ,L, and 
define /ij, e ( - ) as in (4.2) with hi. a replaced by hi. e (-). Assume the regularity 
conditions: (a) hi <e {-) — > hi >a (-) uniformly on compact sets, (b) hi^ e {-) are 
uniformly integrable. (c) Vi—*rf, a positive quantity, in probability, for 
i = 1, . . . , L. Define Hp(-) by (4.1) where hi(-) is either hi A {-) or hi <e {-). 
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Theorem 4.2. In the above setting, Hp(-) is an aCD. 

The proof is based on standard asymptotics using Theorem 4.1 on combi- 
nation of Hi t aS and is not presented here. We would like to remark that, due 
to the special form of the normal density, in the usual case of normal asymp- 
totics (a = |, Hi ja = <£), the combined function Hp(-), with hi(-) = hi i0i (-) 
in (4.1), is an aCD without requiring the regularity conditions (a) and (b). 

For the purpose of comparing the two different combination approaches 
given by (3.1) and (4.1), we now specialize to asymptotic normality based 
aCDs where both methods can apply. Let £j = ^fTi Ti y d . The normality 

based aCD is H i<a {9) = 1 - ff i)0 (&) = 1 - with aCD density h^ a (6) = 

hi,a(£i) = TT^te)- Consider the combined function Hp(-) with input aCD 
densities hi^ a {-) or hi^(-). It is straightforward to verify that Hp(-) in this 
special case is the same as (or asymptotically equivalent to) 



L "I 1/2 

Erii 

i=i 1 



(0 C - 6) 



where 6 C = (J2i=i ^^i)/Ej=i( f£) is the asymptotically optimal linear com- 
bination of Tj, i = 1,2, ...,L. In light of this remark, it is evident that 
the large sample comparison presented below between Han and Hde also 
holds between Hp and Hde- Note that Han is, in fact, a member of the 
rich class of combining methods introduced in Section 3.1, where we pick 

5c (n 1 ,...,n i )=Eii[(t) 1/2 $- 1 (^)] m (3.1). 

The concept of the Bahadur slope, which is at the heart of Section 3, is still 
well defined for aCDs and Hde(') still has the slope optimality. However, the 
concept of slope loses its appeal on aCDs since one can alter the slope of an 
aCD by tampering with its tails, while keeping it an aCD. Nevertheless, if the 
input CDs are the normal based aCDs mentioned above, it is straightforward 
to show that Hde and Han have the same slope (achieving the upper 
bound). This result is noteworthy in the special case when Tj's are means 
of normal samples and Vi = a\ are the known variances, where Han is an 
exact CD. In this case, Han is derived from a UMVUE estimator. 

Next we address the following question: How do Hde and Han compare 
in terms of the lengths of their derived equal-tail, two-sided asymptotic 
confidence intervals? Let us define ^d_b(«) = Hp E (l — a) — H^ E (a) and 
£an(cc) = H^ N (1 — a) — Hj^jipt). Also, we assume, as in Section 3, that 
n\ = n and rij/n are bounded below and above, for j = 2, . . . , L. Let lim* be 
the limit as (n,a) — ► (oo,0). A key aspect of lim* is the fact that it allows a 
to converge to 0, at an arbitrary rate; of course, slow rates are better from 
a practical viewpoint. The proof is given in the Appendix. 



Theorem 4.3. lim* ''[£ p, e (a) jl an (a)] = 1 in probability. 
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Hence, for combining large sample normality based aCDs, the three com- 
bining methods Hde, Hp and Han are equivalent in the above sense. When 
the input aCDs are derived from profile likelihood functions, the f/p-method 
amounts to multiplying profile likelihood functions, in which case (in view 
of the standard likelihood inference) Hp may have a minor edge over Hde 
when a finer comparison is employed. On the other hand, Hde has its global 
appeal, especially when nothing is known about how the input CDs were de- 
rived. It always returns an exact CD when the input CDs are exact. Also, 
Hde preserves the second-order accuracy when the input CDs are second- 
order accurate (a somewhat nontrivial result, not presented here). Aspects 
of second-order asymptotics on Hp are not known to us, while Han ignores 
second-order corrections. 

The adaptive combining in Section 3.2 carries over to Han, since Han(-) 
is a member of the rich class of combining methods introduced there. Also, 
one can turn Hp into an adaptively combined CD by replacing h*(y) in 
(4.1) with hZ(y), where fc* (y) = Ilf=i 60 or ULiH^iV)- The adaptive 
weights 0Ji are chosen such that Ui — > 1 for the "right" CDs (in TLq) and Ui — ► 
for the "wrong" CDs (in 7i\). Some results along the line of Section 3.2 
can be derived. 

We close this section with the emerging recommendation that while nor- 
mal type aCDs can be combined by any of the methods Hde, Hp or Han, 
exact CDs and higher-order accurate CDs should generally be combined by 
the DE method. 

5. Examples. 

5.1. The common mean problem. The so-called common mean problem 
of making inference on the common mean, say fi, of two or more normal 
populations of possibly different variances, also known as the Behrens- 
Fisher problem, has attracted a lot of attention in the literature. In the 
large sample setting, it is well known that the Graybill-Deal estimator, 
Agd = {{n 2 /sl)Xx + (n 1 /sf)X 2 }/{{n 1 /sj) + (n 2 /s%)}, is asymptotically ef- 
ficient. In the small sample setting, there is still research going on attempt- 
ing to find efficient exact confidence intervals for ix. In particular, Jordan 
and Krishnamoorthy [(1996), through combining statistics] and Yu, Sun and 
Sinha [(1999), through combining two-sided p-values] proposed efficient ex- 
act confidence intervals for the common mean fi; however, there is a small 
but nonzero chance that these intervals do not exist. 

Let us consider the CD based method, first under large sample settings. 

In this case, we start with normal based aCDs H\ a (y) = <1>( ^T^ ) and 

H 2a {y) = ^( s 2 ~/^H. )• Following Section 4.2, we would like to prescribe the 
combined CDs Han(u) [or Hp(y), which is the same]. It is interesting to 
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note that this combined CD is the same as the CD directly derived from 
the Graybill-Deal estimator. Thus, the confidence intervals derived from 
Han(9) are asymptotically shortest. 

If one wants to obtain exact confidence intervals for fx, one can turn 
to the recipe prescribed in Section 3.1. Clearly, exact CDs for fj, based on 

two independent normal samples are H\{y) = Ft _ 1 ( V ~, X A ) and #2(2/) = 

Ft n2 _ 1 { j^ri= )? respectively; see Example 2.1. By Theorem 3.2, the DE based 
approach will be Bahadur optimal among all exact CD based approaches of 
Section 3. The resulting exact confidence interval for fj,, with coverage 1 — a, 
is {<la/2i Qi-a/2) 1 where q s is the s-quantile of the CD function He>e{v)- This 
exact confidence interval for /i always exists at every level a. 

We carried out a simulation study of 1000 replications to examine the cov- 
erage of the CD based approaches, under three sets of sample sizes (ni, 712) = 
(3,4), (30,40) or (100,140) and two sets of (true) variances (erf , of) = (1, 1.5 2 ) 
or (1,3.5 2 ). The coverage of constructed 95% confidence intervals is right 
on target around 95% in the six cases for the Hp>E based exact method. 
However, the Graybill-Deal (i.e., aCD Han or Hp) based method leads to 
serious under-coverage (84.8% and 85.9%) in the two cases with small sam- 
ple sizes (711,712) = (3,4), and notable under-coverage (93.3% and 93.6%) in 
the two cases with moderate sample sizes (711,712) = (30,40). So, in small 
sample cases, the exact CD based approach is substantially better, in terms 
of coverage. 

Theorem 4.3 suggests that, under a large sample setting, the DE based 
approach and the Graybill-Deal estimator (equivalently, Han or Hp) based 
approach will have similar lengths for confidence intervals with high asymp- 
totic coverage. We carried out a simulation study to compare the lengths in 
the two cases with large sample size (711,712) = (100, 140), at confidence level 
95%. We found that the lengths corresponding to the Hee based method, 
on average, are slightly higher than those corresponding to the Graybill- 
Deal estimator, but they are not too far apart. The average ratio of the 
lengths, in the 1000 simulations, is 1.034 for (of, of) = (1, 1.5 2 ) and 1.081 
for (of, of) = (1, 3.5 2 ). Similar ratios were also obtained for the 90% and 99% 
confidence intervals under the same setting. The simulation results seem to 
endorse our recommendation at the end of Section 4. 

5.2. Adaptive combination of odds ratios in ulcer data. Efron [(1993), 
Section 5] studied an example of combining independent studies. The exam- 
ple concerns a randomized clinical trial studying a new surgical treatment 
for stomach ulcers [Kernohan, Anderson, McKelvey and Kennedy (1984)] in 
which there are 9 successes and 12 failures among 21 patients in treatment 
groups, and 7 successes and 17 failures among 24 patients in the control 
group. The parameter of interest is the log odds ratio of the treatment. 
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Based on the data, the estimated log odds ratio is (9i = log(^/^) = 0.600, 
with estimated standard error d\ = + yj + 7 + Tf) 1 ^ 2 = 0.629. In addi- 
tion to Kernohan's trial, there were 40 other randomized trials of the same 
treatment between 1980 and 1989 [see Table 1 of Efron (1996) for the com- 
plete data]. The question of interest is how to combine the information in 
these 40 studies with that in Kernohan's trial. Efron (1993) employed an 
empirical Bayes approach, where he used a Bayes rule to combine the im- 
plied likelihood function of Kernohan's trial L* x {9) ~ ^( ^a-f 1 ) with a prior 
distribution tt s (9) oc J2j=2 a--"' )" Here (f){t) is the density function of the 

standard normal distribution, and 6j and &j, j = 2, ... ,41, are the estima- 
tors of the log odds ratios and standard errors in the 40 other clinical trials. 
To obtain meaningful estimates of 0j and &j in the analysis, nine entries of 
zero were changed to 0.5; see Efron (1993). 

We re-study this example, utilizing the purely frequentist CD combina- 
tion approach. Under the standard assumption that the data in each of 
these 41 independent clinical trials are from a four-category multinomial 

distribution, it is easy to verify that Hj(y) = <j?(^J^-), j = 1,2, . . . ,41, are a 
set of first-order normal aCDs of the 41 clinical trials. We use the combined 
aCD H AN (i.e., taking g c (m, . . . ,u L ) = Y,iLi[j-^~ 1 (u.i)] in (3.1)), both with 
and without adaptive weights, to summarize the combined information. Al- 
though there is no way to theoretically compare our approach with Efron 's 
empirical Bayes approach, we will discuss the similarities and differences of 
the final outcomes from these two alternative approaches. 

First, let us temporarily assume that the underlying values of 9 in these 41 
clinical trials are all the same. So, each trial receives the same weight in com- 
bination. In this case, the combined aCD is H% N (9) = $({Efii ^^}/{Eiii jiY' 2 ) = 

$(7,965(6* + 0.8876)). The density curve of H S AN (9) is plotted in Figure 2(a), 
along with the posterior density curve (dashed line) obtained from Efron's 
empirical Bayes approach. For easy illustration, we also include (in each 
plot) two dotted curves that correspond to the aCD density of Kernohan's 
trial h\{9) = H[(9) and the average aCD densities of the previous 40 trials 

9a(0) = irE'i 2 i</'( £ #); note that h x {6) « L%(9), Efron's (1993) implied 
likelihood L*(9), and g a {9) oc ir e (9), the empirical prior used in Efron (1993). 
It is clear in Figure 2(a) that the aCD curve of H AN (9) is too far to the left, 
indicating a lot of weight has been given to the 40 other trials. We believe 
that the assumption of the same underlying values of 9 in all of these 41 
clinical trials is too strong; see also Efron (1996). 

A more reasonable assumption is that some of the 40 other trials may 
not have the same underlying true 9 as in Kernohan's trial. It is sensible 
to use the adaptive combination methods proposed in Section 3.2, which 
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Fig. 2. The solid curves in (a)-(d) are combined CD density curves, combined with (a) 
equal weight to all 41 trials, (b) normal kernel weights, (c) 0-1 weights with a„ — 0.25, 
(d) 0-1 weights with a n = 0.30. The dashed curves are the posterior density function 
(approximated) from Figure 4 of Efron (1993). The two dotted curves (with peaks from 
right to left) correspond to the aCD density of Kernohan' s trial hi (9) [i.e., Efron's (1993) 
implied likelihood L* x (6)] and the average aCD densities of the 40 other trials [proportional 
to the empirical prior 7r e (#) used in Efron (1993)]. 



downweight or exclude the trials with the underlying parameter value away 
from that of Kernohan's trial. Three sets of adaptive weights are consid- 
ered: one set of normal kernel weights uj^ oc (p( Ml J^ Ij ), and two sets of 

or 1 adaptive weights wj = l^n/^e)) with Ii = (H~ 1 (a n /2),Hf (1 — 
a n /2)). Here Mj = i?i(|) is the median of the aCD Hi of the ith trial, 
and following Remark 3.1, we take a n = 0.25 and 0.30, respectively, in 
the two sets of wj's. The three corresponding combined CDs are, respec- 
tively, H% N {0) = $(5.7788(0 - 0.1029)), H^ N {6) = $(5.4007(0 - 0.1199)) 
and H% N (6) = $(5.3051(0 - 0.1698)). Their density curves are plotted in 
Figure 2(b)-(d). We also tried triangle and rectangular kernel weights, but 
the results were very similar and are not presented here. A noteworthy fea- 
ture of the combined CD density curves is the following: when all the weights 
are equal, the combined curve puts little mass to the right of 0, while all the 
rest put substantial mass to the right of 0. 
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Comparing the three adaptively combined CDs with the posterior distri- 
bution obtained by Efron [(1993), Figure 4] on the same data set [Figure 
2(b)-(d)], we find that they all have very similar features. Their density 
curves all peak within a small range and between the peaks of hi(0) and 
g a (0), actually much closer to hi(6), reflecting intuitive intent of such com- 
binations. But there is also a quite notable difference. The spans of all the 
combined CD densities are smaller than that of the posterior density func- 
tion. Note that in Efron's empirical Bayes approach, all 40 other trials have 
equal contributions (as a part of the prior) to the final posterior through 
Bayes formula. In the adaptive approach, the trials closer to Kernohan's 
trial have more contribution (i.e., higher weights) than those trials farther 
away. It seems that much more information from the 40 other clinical trials, 
especially those with Hj(0) closer to Hi(0), has been drawn in the adaptive 
CD combination method. 

5.3. Computationally intense methodology on a large data set. One can 
utilize CDs to find a way to apply statistical methodology involving heavy 
computations on a large data set. Here, we illustrate the "split and combine" 
approach. We divide the data into smaller data sets; after analyzing each 
sub-data set separately, we can piece together useful information through 
combining CDs. For a computationally intense methodology, such a method 
can result in tremendous saving. Suppose the number of steps involved in 
a statistical methodology is cn 1+a , n being the size of the data set, a > 0. 
Suppose the data set is divided into k pieces, each of size The number 
of steps involved in carrying out the method on each subset is c(j) 1+a . 

Thus, the total number of steps is ck(^) 1+a = cn ^ a ■ If the effort involved in 
combining CDs is ignored, there is a saving by a factor of k a . We think that 
the information loss due to this approach will be minimal. One question is 
how to divide the data. Simple approaches include dividing the data based 
on their indices (time or natural order index), random sampling or some 
other natural groupings. 

For the purpose of demonstration, let us consider a U -statistic based ro- 
bust multivariate scale proposed by Oja (1983). Let {X±, . . . ,X n } be a two- 
dimensional data set. Oja's robust multivariate scale is S n = medianj areas of all 

triangles formed by 3 data points}. For any given three data points (xi,yi), 
(22,2/2) and (23,2/3), the area of their triangle is given by M det (tit2t3 ) |, 
where tj = (l,Xi,yi)', for I = 1,2,3. To make inference on this scale parame- 
ter, it is natural to use the bootstrap. But obtaining the bootstrap density 
of S n is a formidable task when the sample size n is large. For example, 
even with n = 48 the computation of S n involves evaluating the area of 
48 x 47 x 46/6 = 17,296 triangles. With 5000 (say) repeated bootstrapping, 
the total number of triangle areas needed to be evaluated is 86.48 million. If 
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one adopts the "split and combine" approach discussed above, say, randomly 
splitting the data set of size 48 into two data sets of size 24, the number of tri- 
angle areas needed to be evaluated is 2 x 5000 x (24 x 23 x 22/6) = 20.24 mil- 
lion. This is less than 1/4 of the 86.48 million. If we randomly split the data 
set of size 48 into three data sets of size 16 each, the number of triangle ar- 
eas needed to be evaluated is 3 x 5000 x (16 x 15 x 14/6) = 8.4 million, less 
than 1/10 of the 86.48 million. Since bootstrap density functions are aCDs, 
the bootstrap density functions obtained from each sub-dataset can be com- 
bined together, using the techniques of combining CDs. The combined CD 
can be used to make inferences on the robust multivariate scale S n . 

Figure 3 plots bootstrap density functions of the robust multivariate scale 
S n based on a simulated two-dimensional data set of size 48. The data set 
was generated with the sth observation being (zg + z\ ? ,Zs — Zs ), where 
z [ s ] and zf \ 8 = 1,..., 48, are simulated from the Cauchy distributions with 
parameters center = and scale = 1 and center = 1 and scale = 1.3, respec- 
tively. The solid curve in Figure 3(a) is the bootstrap density function of 
S n based on 5000 bootstrapped samples. It took 67720.75 seconds to gener- 
ate the density curve on our Ultra 2 Sparc Station using Splus. The dotted 




X X 

(a) (b) 

Fig. 3. Figure (a) is for density curves and (b) is for cumulative distribution curves. The 
solid, dotted and dashed-line curves correspond to methods described in the main context 
with no split, split into two pieces and split into three pieces, respectively. 
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and broken curves are the bootstrap density functions of S n using the "split 
and combine" method, where we split the sample randomly into two and 
three pieces, respectively. It took 12,647.73 and 5561.19 seconds to gener- 
ate these two density plots, including the combining part. Hence, the "split 
and combine" method used less than 1/4 and 1/10 of the time, respectively! 
From Figure 3, it is apparent that all three curves are quite alike. They all 
seem to capture essential features of the bootstrap distribution of the robust 
multivariate scale S n . 



APPENDIX 

A.l. Proofs in Section 3. 



Proof of Lemma 3.1. Let T and W be independent random vari- 
ables, such that T has the standard double exponential distribution while 
W satisfies, for t > 0, P(W < -t) = P(W >t) = |Vfc(t)e _ *, where V k (-) is a 
polynomial of finite degree. We write, for t > 0, P(T + W >t) = \P(T+ W > 
t\T > 0, W > 0) + \P{T + W > t\T > 0, W < 0) + \P{T + W > t\T < 0, W > 
0) =1 + 11 + III (say). Now, 

I = \P(W > t\T > 0, W > 0) + \P(T + W > t, W < t\T > 0, W > 0) 



-(*-«) 



[V k (s)-Vas)]e- S ds 



4 e 



II 



V k (t)+ f [V k (s)-V k '(s)}d S 
Jo 

\P(T>t-W\T>0,W <0) 
\P(T>t + W\T>0,W>0) 



4 e 



<^[V k (s)-Vl(s)]e- s ds 

XD 

{V k (s)-V k \ S )}e~ 2s d S 



Similarly, 77/ = |e t [f£° V k (s + t)e 2s ds]. The proof is concluded by letting 
W have the distribution of Yli^ij f° r k = 1,2, ... , successively. □ 

Before we prove Theorem 3.2, we first prove a lemma. This lemma borrows 
an idea from Littell and Folks (1973). 

Lemma A.l. Let H c (u\, . . . ,u£) be a function from [0, 1]^ to [0, 1], mono- 
tonically nondecreasing in each coordinate. Also, suppose H C (U\, . . . , Ul) has 
the £7(0,1) distribution when U\, . . . ,Ul are independent £7(0,1) random 



COMBINING CONFIDENCE DISTRIBUTIONS 23 
variables. Then for any u\, . . . ,ul in [0,1], H c (u\, . . . , ui) > T\j>=i u i an d 

i-h c ( Ui ,..., Ul ) > nf=i(i — "/)■ 

Proof. In view of the monotonicity, it follows that {U\ < u\,U2 < 
U2, ■ ■ ■ ,Ul< ul} implies {H C {U\, . . . ,Ul) < H c (ui, . . . The first claim 

follows if we take Ui, U2, ■ ■ ■ , Ul, as independent U(0, 1) random variables. 
The second claim can be proved similarly via the facts that {U\ > u±, . . . ,Ul > 
u L } implies {1- H C (U±, . . . ,U L ) < 1- H c (m, . . . ,u L )} and that 1- H c (Ui, . . . ,U L ) 
follows U(0, 1) distribution when U\, . . . , Ul are independent U(0, 1) random 
variables. □ 

Proof of Theorem 3.2. In Lemma A.l take u\ = Hi(x), . . . , ul = 
Hl{x). The first result follows immediately from Lemma A.l. Note m/n\ — > 

Aj, where m = n\ + n2-\ h til- The second result follows from the first 

result. 

The next two equalities related to Hei and He2 can be obtained from 
direct calculations, appealing to the fact that the upper tail area of the 
xlzrdistribution satisfies \\m y ^ +00 ^\og P(x\l > d) = ~\i where \\ L ls a 
X^-distributed random variable. Note, by Lemma 3.1, lim^^+oo i log DEl(—u) = 
lim y _> +00 -log(l — DEl(u)) = —1. Using this, it is seen that the last two 
claims also hold. □ 

The proof of Theorem 3.5 critically depends on the following lemma. 

Lemma A. 2. Under the condition of Theorem 3.5, with a;* as in (3.5), 
one has 

sup | G C)LU * (t) — G c w (o) (t) | — > in probability, 
t 

where u>(°) = (1, , . . . , w[ 0) ), ujf ] = 1 if Hj € H and cof = if Hj £ Hi ■ 

Proof. Let Y = {Y\, . . . , Y£) be i.i.d. r.v.'s having the distribution F$, 
independent of the original data. Clearly uj* — ► u^- in probability, for i = 
2, . . . , L. Note that when \u>* — uf^ \ < 5, for a S > and all i = 2, . . . , L, we 
have 

py (Vi +E<4 0) *i + < < (Vi +$>;r i ^ *) 

\ i=2 i=2 J \ i=2 J 

<p Y (Y 1+ j24 0) ^-sj:\n<t). 

\ i=2 i=2 ) 
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Using standard arguments, one deduces that, as \5\ — > 0, 
L \ / L L 



sup 

t 



Py (y 1 + 5>{°>y, <t)-p Y (y 1 + 5>fy, + 5 E W < t) 



0. 



The lemma follows from combining the above assertions. □ 

Proof of Theorem 3.5. For Hi e Ho, supi a ._ 0o i< 5n | x F _1 (i/j(x)) 

using the condition on S n . For Tfj G 7i\ and |x — 6*o | < S n , min{llj(x),l — 
-ffi(x)} tends to at an exponential rate. Therefore, F Q ~ 1 (Hi(x)) = 0(n), 
since the tails of Fq decay at an exponential rate as well. Using the assumed 
condition on b n and the kernel function, we deduce that cd* — > in proba- 
bility, faster than any polynomial rate, for i such that Hi E 7i\. Thus, for 
such an i, ^P\ y -0 o \<s n l^i (Hi( v ))\ — > in probability. The theorem now 
follows, utilizing Lemma A. 2. □ 

A. 2. Proofs in Section 4. 



Proof of Theorem 4.1. For simplicity, we assume = (— oo,+oo); 
other situations can be dealt with similarly. Note we only need to prove that 
H P (9 ) is 17(0,1) distributed. 

Define an L — 1 random vector Z = (Z\, Z2, • • • , Zl_\) t , where Zj = Tj — Tl, 
for j = 1, 2, . . . , L — 1. So the joint density function of Z is fz( z ) = fz(,Zi, %2, ■ ■ ■ ,%L 
J-m n^=i hj(zj + u)hi(u) du and the conditional density of Tl — 60, given 
Z, is f TL \z{t) = Uf=i hj(Zj + t)h L (t)/f z (Z). Also, for each given Z, we de- 
fine a decreasing function Kz(j) 
that 



I^°°Ylf=ihj(Zj + u)Jil(u) du. It is clear 



H P (9)=K z (T L -e)/fz(Z). 
So for any s, < s < 1, we have 

P{Hp(0 ) <s} = P{T L -6 > K^isfziZ))} 

= E[P{T L -9 >K- z \sf z (Z))\Z}} 



E 



E 



Y^zlHz j + t)h L {t) 



sfz{Z) 1 





/z(Z) 



/z(Z) 



where the fourth equality is due to a monotonic variable transformation in 
the integration: u = Kz(t). □ 
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Proof of Theorem 4.3. Let 9q be the common value of the parameter. 
To prove the claim, we show that for any e > 0, P(Hde(9o + (1+£)£an(ol)) > 
1 — a) — > 1, P(Hde(0q + (1 — e)Ian(oi)) < 1 — a) — > 1, and similar results on 
the lower side of 9q. We establish the first claim below; others can be proved 
similarly. 

Let us note that, under lim*, 



J2 DE- 1 ( $ f ^ [0 O + (1 + e)e AN (a) - T t ] 
i=i V V T « 

= £ DE- 1 U f^i[(l + e )^(a)] [1 + o p (l)} 

= E kv^J(l + e)^(a)][l + o P (l)]/^} 2 = [(1 + e) 2 z 2 j2] [1 + 

Thus, by Lemma 3.1, 

1 - H DE (0 + (1 + e)i AN {a)) = o p (a 1+£ ). □ 
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